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(54) Improved microprocessor 

(57) A digital system is provided with a several proc- 
essors, a private level 1 cache associated with each 
processor, a shared level 2 cache having several seg- 
ments per entry, and a level 3 physical memory. The 
shared level 2 cache architecture is embodied with 
4-way associativity, four segments per entry and four 
valid and dirty bits. When the level 2-cache misses, the 
penalty to access to data within the level 3 memory is 
high. The system supports miss under miss to let a sec- 
ond miss interrupt a segment prefetch being done in re- 
sponse to a first miss. Thus, an interruptible SDRAM to 



L2-cache prefetch system with miss under miss support 
is provided. A shared translation lookaside buffer (TLB) 
is provided for level two accesses, while a private TLB 
is associated with each processor. A micro TLB (p.TLB) 
is associated with each resource that can initiate a mem- 
ory transfer. The level 2 cache, along with all of the TLBs 
and ^TLBs have resource ID fields and task ID fields 
associated with each entry to allow flushing and clean- 
ing based on resource or task. Configuration circuitry is 
provided to allow the digital system to be configured on 
a task by task basis in order to reduce power consump- 
tion. 
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Description 

[0001] This invention generally relates to microprocessors, and more specifically to improvements in direct memory 
access circuits, systems, and methods of making. 

5 [0002] Microprocessors are general purpose processors which provide high instruction throughputs in order to ex- 
ecute software running thereon, and can have a wide range of processing requirements depending on the particular 
software applications involved. A direct memory access (DMA) controller is often associated with a processor in order 
to take over the burden of transferring blocks of data from one memory or peripheral resource to another and to thereby 
Improve the performance of the processor. 

10 [0003] Many different types of processors are known, of which microprocessors are but one example. For example. 
Digital Signal Processors (DSPs) are widely used, in partu:ular for specific applications, such as mobile processing 
applications. DSPs are typically configured to optimize the performance of the applications concerned and to achieve 
this they employ more specialized execution units and Instruction sets. Particularly in applications such as mobile 
telecommunications, but not exclusively, it Is desirable to provide ever Increasing DSP performance while keeping 

15 power consumption as low as possible. 

[0004] To further improve perfomiance of a digital system, two or more processors can be Interconnected. For ex- 
ample, a DSP may be interconnected with a general purpose processor in a digital system. The DSP performs numeric 
Intensive signal processing algorithms while the general purpose processor manages overall control flow. The two 
processors communicate and transfer data for signal processing via shared memory. 

20 [0005] Particular and preferred aspects of the invention are set out in the accompanying independent and dependent 
claims. Combinations of features from the dependent claims may be combined with features of the independent clainrts 
as appropriate and not merely as explicitly set out In the claims. 

[0006] In accordance with a first aspect of the invention, there Is provided a digital system having several processors, 
a private level 1 cache associated with each processor, a shared level 2 cache having several segments per entry, 
25 and a level 3 physical memory. The shared level 2 cache architecture is embodied with 4-way associativity, four seg- 
ments per entry and four valid and dirty bits. When the level 2-cache misses, the penalty to access to data within the 
level 3 memory is high. The system supports miss under miss to let a second miss Interrupt a segment prefetch being 
done in response to a first miss. 

[0007] In another embodiment, a shared translation lookaside bufter (TLB) Is provided for level two accesses, while 
30 a private TLB Is associated with each processor. A mtoro TLB (p.TLB) is associated with each resource that can initiate 
a memory transfer. The level 2 cache, along with all of the TLBs and p,TLBs have resource ID fields and task ID fields 
associated with each entry to allow flushing and cleaning based on resource or task. 

[0008] In another embodiment, configuration circuitry Is provided to allow the digital system to be configured on a 
task by task basis in order to reduce power consumption. 
35 [0009] Partk^ular embodiments in accordance with the Invention will now be described, by way of example only, and 
with reference to the accompanying drawings in which like reference signs are used to denote like parts and in whk:h 
the Figures relate to the digital system of Figure 1 and in which: 

Figure 1 is a block diagram of a digital system that includes an embodiment of the present invention In a megacell 
40 core having multiple processor cores; 

Figure 2 is a more detailed block diagram of the megacell core of Figure 1 ; 

Figure 3 is a block diagram illustrating a shared translation lookaside buffer (TLB) and several associated mi- 
cro-TLBs (n-TLB) included In the megacell of Figure 2; 
Figure 4 illustrates an entry In the TLB and ^iTLBs of Figure 3; 
45 Figure 5 Is a block diagram Illustrating a smart cache that has a cache and a RAM set, versions of which are 

included in each of the processor cores of Figure 1 ; 

Figure 6 is an illustration of loading a single line Into the RAM set of Figure 5; 
Figure 7 is an illustration of loading a block of lines into the RAM set of Figure 5; 

Figure 8 Is an illustration of intemipting a block load of the RAM set according to Figure 7 in order to load a single 
50 line within ttie block; 

Figure 9 is an illustration of a level 2 cache In the megacell of Figure 1 that has an Intenojptible prefetch system, 
and thereby provides miss under miss support; 

Figure 1 0 illustrates concurrent access of the level 2 cache and level 2 RAM of the megacell of Figure 1 ; 
Figure 11 A Illustrates a request queue for the level 2 memory system of Figure 1 0; 
55 Figure 11 B is a more detailed block diagram of the level 2 memory system of Figure 10, illustrating the request 

queue; 

Figure 1 2 illustrates tag circuitry with task ID and resource ID fields in the level 2 cache of the megacell of Figure 2; 
Figure 13 is a block diagram illustrating monitoring circuitry within the megacell of Figure 2 to manage cleaning 
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and flushing based on average miss rate measure; 

Figure 14 Is a block diagram illustrating a priority register in each processor of the megacell of Figure 2 for task 
based priority arbitration; 

Figure 15 is a block diagram of the level 1 caches in the megacell of Figure 2 illustrating control circuitry for inter- 
5 ruptible block prefetch and clean functions; 

Figure 16 is a block diagram of the cache of Figure 15 Illustrating a source/destination register for DMA operation; 

Figure 17 illustrates operation of the cache of Figure 16 using only a global valid bit for DMA completion status; 

Figure 18 illustrates operation of the cache of Figure 15 in which a block of lines is cleaned or flushed; 

Figure 19 illustrates register and arbitration circuitry in the cache of Figure 15 to support local memory with DMA 
10 operation simultaneously with RAM set operation in the same RAM set; 

Figure 20 illustrates use of a local valid bit to support concun^ent DMA and CPU access in the cache of Figure 15; 

Figure 21 illustrates operation of the TLB of Figure 3 for selective flushing of an entry for a given task or resource; 

Figure 22 illustrates control circuitry for adaptive replacement of TLB entries in the TLB of Figure 3; 

Figure 23 is a block diagram of control circuitry in the megacell of Figure 2 for dynamic control of power management 
IS systems using task attributes; 

Figure 24 illustrates dynamk: hardware configuration of the megacell of Figure 2 using task attributes; 

Figure 25 illustrates task based event profiling to perfomn task scheduling for control of power dissipation within 

the system of Figure 1 ; 

Figure 26 illustrates operation of the level 2 TLB of Figure 2 while being shared between different operating systems; 
20 Figure 27 illustrates operation of the level 1 TLB of Figure 2 while being shared between different memory access 

requestors operating under a common operating system; 

Figure 28A is a representation of a telecommunications devk^ incorporating an embodiment of the present inven- 
tion; and 

Figure 28B is a block diagram representation of the telecommunications devk:e of Figure 28A. 

25 

[0010] Corresponding numerals and symbols in the different figures and tables refer to corresponding parts unless 
otherwise indk^ted. 

[0011] Although the invention finds particular applbatlon to Digital Signal Processors (DSPs), implemented, for ex- 
ample, in an Application Specific Integrated Circuit (ASIC), it also finds application to other forms of processors. An 
30 ASIC may contain one or more megacells whch each include custom designed functional circuits combined with pre- 
designed functional circuits provided by a design library. 

[0012] Figure 1 is a block diagram of a digital system that includes an embodiment of the present invention in a 
megacell core having multiple processor cores. In the interest of clarity. Figure 1 only shows those portions of micro- 
processor 100 that are relevant to an understanding of an embodiment of the present invention. Details of general 

35 construction for DSPs are well known, and may be found readily elsewhere. For example, U.S. Patent 5,072,41 8 issued 
to Frederick Boutaud, et al, describes a DSP in detail. U.S. Patent 5,329,471 issued to Gary Swoboda, et al, describes 
in detail how to test and emulate a DSP. Details of portions of microprocessor 1 00 relevant to an embodiment of the 
present invention are explained in sufficient detail herein below, so as to enable one of ordinary skill in the mk:roproc- 
essor art to make and use the invention. 

^ [0013] Referring again to Figure 1 , a megacell (Atlas Core) includes a control processor (MPU) and a digital signal 
processor (DSP) that share a block of memory (RAM) and a cache. A traffic control block (Atlas traff te control) receives 
transfer requests from a memory access node in the host processor and transfer requests from a memory access node 
in the DSP. The traffic control block interieaves these requests and presents them to the memory and cache. Shared 
peripherals are also accessed via the traffk: control block. A direct memory access controller (DMA) can transfer data 

45 between an extemal source and the shared memory. Various application specifk: processors or hardware accelerators 
(HW Acc) can also be included within the megacell as required for various applk:ations and interact with the DSP and 
MPU via the traffic control block. 

[0014] Extemal to the megacell, a second traffic control block (System Traffk: Controller) is connected to receive 
memory requests from the Atlas traffic control block in response to explicit requests from the DSP or MPU, or from 

50 misses in the shared cache. Off chip memory (extemal) and/or on-chip memory is connected to the system traffk: 
controller. A frame buffer (local frame buffer) and a display device (Display) are connected to the system traff k: controller 
to receive data for displaying graphk^al images. A host processor (system host) interacts with the resources on the 
megacell via the system traffic controller. A set of peripherals (DSP Private) are connected to the DSP, while a set of 
peripherals (MPU Private) are connected to the MPU. 

55 [0015] Figure 2 is a more detailed block diagram of the megacell core of Figure 1 . The DSP includes a local memory 
(Loc RAM) and data cache (D-C), a smart cache that is configured as instmction cache (l-C) and a block of memory 
(RAM set). The DSP is connected to the traffk: controller via a level 2 interface (L2 IF) that also includes a translation 
lookaside buffer (TLB). A DMA circuit is also included within the DSP. Individual micro TLBs (n-TLB) are associated 
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with the DMA circuit, data cache and instruction cache. 

[0016] SImilarty, the MPU (Concorde) Includes a local memory (Loc RAM) and data cache (D-C), a smart cache that 
is configured as Instruction cache (l-C) and a block of memory (RAM set). The MPU is connected to the traffic controller 
via a level 2 interface {12 IF) that also includes a TLB. A DMA circuit is also included within the MPU. Individual micro 
5 TLBs (m-TLB) are associated with the DMA circuit, data cache and instruction cache. 

[0017] The L2 traffic controller includes a TLB and a micro-TLB (jiTLB) that is associated with the DMA block (Sys 
DMA), 

Memory Management Unit 

10 

[0018] At the megacell traffic controller level, all addresses are physk:al. They have been translated from virtual to 
physical at the processor sub-system level by a memory management unit (MMU) associated with each core. At the 
processor level, access pemnission, supplied through MMU page descriptors, is also checked, while at the megacell 
level protection between processors is enforced by others means, which will be described in more detail later. 

15 [001 9] An address reference is generally located within the ^.TLB or main TLB of each processor sub-system; how- 
ever, but some references, such as those used by DMA/Host/... to access megacell memories can be distributed within 
the L2 traffic controller and cached into L2 system shared TLB. Because system perf omnance is very sensitive to the 
TLB architecture and size, it is important to implement efficient TLB control command to flush, lock or unlock entry 
when a task is created or deleted without degrading the execution of other tasks. Therefore, each ^TLB and each 

20 cache entry holds a task-ID, also called ASID. During execution, the current task-ID register is compared with the p.TLB 
entry, this also provides better security, as will be described later. During MMU operation, commands are supplied to 
flush locked or unlocked entries of a p.TLB corresponding to a selected task. 

[0020] To provide maximum flexibility, the MMU is based on a software teible walk, backed up by TLB caches both 
at the processor sub-system and megacell level. This allows easy addition of new page size support or new page 
25 descriptor information if required. A TLB miss initiates a MMU handler routine to load the missing reference into the 
TLB. At the megacell (ATLAS Core) level, the TLB miss is routed via the system interrupt router to the processor having 
generated the missing reference or to the processor in charge of the global memory management. 
[0021] The MMU provides cachability and bufferability attributes for all levels of memory. The MMU provides also 
the information "Shared" to indbate that a page is shared among multiple processors (or task). This bit, as standalone 
30 orcombined with the task-ID, allows specific cache and TLB operation on data shared between processors or/and tasks. 
[0022] AH megacell memory accesses are protected by a TLB. As they all have different requirements in temn of 
access frequencies and memory size, a shared TLB approach has been chosen to reduce the system cost at the 
megacell level. This shared TLB is programmable by each processor. The architecture provides enough flexibility to 
let the platform working with independent operating systems (OS) or a distributed OS with a unified memory manage- 
rs ment. 

[0023] The present embodiment supports page size of 1 K, 4K, 64K and 1 MB, but other embodiments might have 
TLB hardware supporting other additional page sizes. 

[0024] The organization of the data structures supporting the memory management descriptor is left open since each 
TLB miss Is resolved by a software TLB-miss handler. These data structures include the virtual-to-phystcal address 
^ translation and all additional descriptors to manage the memory hierarchy. The list of these descriptors and theirf unction 
is described below. Table 1 includes a set of memory access pennission attributes. In other embodiments, a processor 
may have other modes that enable access to memory without permission checks. 



Table 1 - 



Memory Access pennission 


Supervisor 


User 


Description 


No access 


No access 




Read only 


No access 




Read only 


Read only 




Read/Write 


No access 




Read/Write 


Read only 




Read/Write 


Read/Write 





Execute Never provides access pennission to protect data memory area from being executed. This information can 
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be combined with the access permission described above or Icept separate. 

Shared: indicates that this page may be shared by multiple tasks across multiple processor 

Cachability: Various memory entities such as individual processor's cache and write buffer, and shared cache and write 

buffer are managed through the MMU descriptor. The options included in the present embodiment are as follows: Inner 

5 cachable, Outer cachable, Inner Write through/write back, Outer write through/write back, and Outer write allocate. 
The temns Inner and outer refer to levels of caches that are be built in the system. The boundary between inner and 
outer is defined In specifk: embodiment, but inner will always include Level 1 cache. In a system with 3 levels of caches, 
the inner correspond to level 1 and level 2 cache and the outer conrespond to level 3 due to existing processor systems. 
In the present embodiment, inner Is Level 1 and outer is Level 2 cache. 

10 Device: all accesses to this type of location must occur in program order. All device regions must also be marked as 
non-cacheable. Accesses to device memory are kept in their size (no burst). 

Blocking/non blocking: determines if a write must be acknowledged on completion of write (D-ack) or as soon as the 

write transaction has been acknowledge (T-ack) 

Endianism: detemnines on a page basis the endianess of the transfer. 

15 

MMU/TLB control operation 

[0025] Figure 3 is a block diagram illustrating a shared translation lookaside buffer (TLB) and several associated 
mrcro-TLBs (p.TLB) included in the megacell of Figure 2. On a \iTLB miss, the shared TLB is first searched. In case of 

20 a hit on the shared TLB, the p.TLB that which missed is loaded with the entry content of the shared TLB. In case of 
miss, the shared TLB generates an interrupt to the processor whose OS supervises the resource which caused the 
miss, and both shared and (xTLBs are loaded. The priority on this shared TLB is managed in the same way as priority 
on memory access. One or more resources can be using the shared TLB. One or more resources can program the 
shared TLB. The replacement algorithm in the shared TLB is under hardware controlled. However, in an embodiment 

25 in which the system has a master CPU with a distributed OS, this master CPU could also bypass the replacement 
algorithm by selecting a victim entry, reading and writing directly to the Shared TLB. 

[0026] A resource identifier is loaded into the TLB along with thetask-ID. Resource-IDs and task IDs are not extension 
fields of the virtual address (VA) but simply address qualifiers. With the task-ID, all entries in a TLB belonging to a 
specific task can be identified. They can, for instance, be invalidated altogether through a single operation without 

30 affecting the other tasks. Similarly, the resource ID is required because task-ID number on the different processors 
might not be related; therefore, task related operations must be, in some case, restricted to a resource-ID. The TLB 
cache also includes the "shared" bit and a lock bit for each of its entry. At system Initialization, all R-ID and Task-ID 
registers distributed across the system are set to zero, meaning that the system behaves as if there were no such fields. 
[0027] Figure 4 illustrates an entry in the TLB and p.TLBs of Figure 3. A processor can initiate the following operations: 

35 Invalidate entry with VA, Invalidate all entries related to a Task.lD, Invalidate alt entries related to a RJD, Invalidated 
all shared entry, Invalidate All entries, Lock/UnLock entry, and Lock/Unlock all entries related to a task_ID/R-ID. 
Invalidate entry: The Virtual address (VA), the associated task identifier and the resource identifier in the following 
fonmat are stored at a specif k; address. This generates an entry invalidate operation on the corresponding address, 
task-id and R-ld. Note that all processors of MEGACELL might not be allowed to invalidate entries others than their 

^ own. In that case, the R_id field is replaced by the R-ld, which comes from the R-ld processor register along with the 
address. 

Invalidate all entries related to a task (task_ID): This operation invalidates all entries corresponding to the provided 
task identifier. Note that all processors of the megacell might not be allowed to invalidate entries related to a task others 
than the one they managed. This provides, however, the capability to a master processor to free space from the shared 

45 TLB by invalidating all entries of a task belonging to another processor. This operation invalidates all the entries cor- 
responding to the provided task and resource identifier or to a task of the resource requesting the operation. 
Invalidate all entry related to a Resource (R-ld): This operation invalidates all entries con-esponding to the provided 
resource identifier. Note that all processors of the megacell might not be allowed to invalidate entries others than their 
own. This provides, however, the capability to a master processor to free space from the shared TLB by invalidating 

50 all entries of another processor. 

Invalidated all shared entries: This operation invalidates all entries in the TLB marked as shared of the requester. 
Similariy, the R-ID limits the effect of this operation like the above operations. 

Invalidate all entries: This operation invalidates all entries in the TLB matching the R-ID of the requester. If all R-ID 
registers, distributed in the system are equal, this operation invalidates all entries. In addition, as we might still want 
55 to have different R-ID for DMA engine than for CPUs, a global bit also allows to enable/disable the R-ID comparison. 
Lock/unlock entry: "Lock/unlock entry" is done in providing the VA, task-ID and R-ld, which needs to be locked/unlocked. 
Restriction on R-ID applies as above. 

Lock/unlock all entry related to a task: Lock/unlock is done in providing the task identifier which needs to be locked/ 
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unlocked. Restriction on R-ID applies as above. 

[0028] In the case in which an independent OS is running on each processor, each OS can initiate the above oper- 
ations. In that case, these operations must be restricted to entries with a resource identifier (R-ld) belonging to the 
requester. 

5 [0029] In the case of a single master OS, task and memory management can be viewed as unified, removing the 
need of R-ld. The R-ID can be an extension of the task-ID and as such it comes out of each core or it is hard-coded 
for each processor, in whbh case R-id comparison must configurable as enable/disable. The fomner provides more 
flexibility and removes some complexity in the TLB management: disabling the R-ID is equivalent to having a single 
R-ID for all the system or for part of the system. 

10 [0030] A global control bit will also determines if all the above functions must be limited to the entry corresponding 
to the resource ID requesting the operation. 

[0031] Although it is preferable to have the same page size for memory management on all processors, it is not 
mandatory. In a shared system, the TLB supports all page sizes of the system. 

[0032] Other operations to support more specifically the software TLB handler provides access to the shared system 
15 TLB: Read TLB entry, Write TLB entry, Check (and select victim TLB entry), and Set victim TLB entry. 
Read TLB entry: Read entry pointed by the victim pointer into the TLB entry register. 
Write TLB entry: The content of the TLB entry registers is only written to the TLB if no match occurs. 
Check (and select victim TLB entry): the check and select operation has multiple functions. Its first purpose is to de- 
termine an index value for the replacement of an entry. However, it can also be used to find out if an entry in already 
20 in the TLB. 

[0033] The check and select operation starts from the victim pointer current value (there might be a random algorithm 
done in hardware working in background) . During the search, if none of the entry matches, the vk^tim pointer takes 
the value of the first index that follows the cun^ent index value and whbh is not locked. If all TLB entries are locked, a 
flag is raised. If a matching entry is found, the vk:tim entry points to this matching entry, a flag is raised. 

25 

Shared cache and RAM 

[0034] The megecell includes large shared memories working as a secondary level of cache (L2 Cache) or RAM (L2 
RAM). This level of memory is preferably called the outer level, as each processor includes an inner level 1 memory 
30 subsystem within the memory hierarchy. The megacell outer memory is organized as whats called a SmartCache, 
allowing concurrent accesses on cache and RAMset. RAMset is a block of RAM that has aspects of cache behavior 
and cache control operations as well as DMA capability. The SmartCache architecture provides predictable behavior 
and enhanced real-time perfonnance while keeping high flexibility and ease of use 

[0035] MEGACELL "outer; memory can be shared between MEGACELL internal processors and external Host proc- 
35 essors or peripherals. RAM usage can also be restricted to the usage of a single processor thanks to the MMU mech- 
anism. However, a need might arise in the MEGACELL to add additional physk^al protection per processor on some 
part of MEGACELL memory to overwrite the MMU intrinsic protection. 

[0036] MEGACELL unified shared cache architecture of this embodiment is a four way set associative cache with 
segmented lines to reduce system latency. All outer memories are connected as unified Instruction/data memory to 
40 avoid compiler restrictions such as data In program space or vice-versa. Size of this cache or number of associated 
RAMset may vary In other embodiments. 

[0037] RAMset control registers are memory mapped and therefore also benefit from the protection provided by the 
MMU. However, this would force operations on cache or any specific RAMset to be on separate pages for protection 
reasons. Therefore, a control register is provided to configure how and by which CPU the various part of MEGACELL 
45 memory are controlled. All CPUs can execute operations such as cache flushing or cache cleaning as these operations 
will be restricted by the Resource identifier field located in the TAG area of the cache 

[0038] Figure 5 is a block diagram illustrating a smart cache that has a cache and a RAM set, versions of which are 
included in each of the processor cores of Figure 1 . As discussed above, the SmartCache is composed of a 4-way 
set-associative cache (TAG Array and Data an^ay 2-5) and one or more additional RAM set (Data an^y 0, 1). The 
50 Ramset can be configured as a cache extension or as a block of RAM. When configured as RAM, a loading mechanism 
is provided by a separate DMA engine to optimize data transfer required by multimedia applications. 
[0039] In the embodiment of Figure 5, the RAM Set has two different sized data arrays (Data array 0, Data an^ay 1 ); 
however, other embodiments may specify all RAM sets with the same size to simplify the hardware logic and the 
software model. 

55 [0040] Each RAM set has an associated TAG register (Full set TAG) containing the base address of the RAM set 
and a global valid bit (VG) in addition to an Individual valid bit (Valid), referred to as VI, for each line (entry). In the 
present embodiment, RAM set lines have the same size as the cache lines; however, In other embodiments longer 
line can also be used to reduce the number of VI bits. 
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[0041] A control brt (cache control) for each RAMSet provides the capability to configure it as a cache extension 
(RAM set) or as a RAM. 

[0042] When configured as a RAM, this memory does not overlap with other memories in the system. This memory 
space is mapped as non-cacheable at the outer level. The RAM control logic (address decode) generates a hit equlv- 
5 alent signal, which prevents the outer cache from fetching the missing data/instruction to the external memory. The 
VG bit acts as an enable/disable. It is set when the set_base_addr is written to and cleared when the RAM is invalidated 
or disabled. 

[0043] If the register base address of the RAM is programmed in such way that the associated memory area overlays 
with the external memory, coherency is not guaranteed by hardware. 

10 [0044] When configured as a cache (SmartCache), control circuitry (hit/miss logic) generates hit/miss signals called 
hit-hit and hit-miss for each RAMset. A hit-hit is generated when a valid entry of the RAM Set matches the address 
provided by the core. An entry is valid when both VG and its VI are set. A hit-miss signal is generated when the base 
address of the RAM is valid (VG = 1) and matches the top address provided by the core but the selected entry in the 
RAM set has its VI equal to zero. 

15 [0045] The hit-miss or hrt-hit signal has precedence overthe hit way of the 4-way set-associative cache. This implies 
that any value loaded previously in the cache that should be in the RAM set is never selected and will eventually be 
removed from the cache. However, data can create coherency problem in case of modified data (copy back). Therefore, 
it is recommended to write back ("clean") or even flush the range of address that will correspond to the RAMset range 
of addresses. 

20 [0046] The RAM set can be loaded in two ways: Line-by-line fill, and Complete fill/ block fill. Figure 6 is an illustration 
of loading a single line into the RAM set of Figure 5. When a new value is written into the full-set TAG register (base 
address), all content of the RAMSet is invalidated. Following the programming of the base address register, the RAM 
set is going to fill itself one line at a time on every hit-miss located in the RAM set. 

[0047] Figure 7 is an illustration of loading a block of tines into the RAM set of Figure 5. The block fill is based on 
25 two additional registers called Start (CNT) and End (END). Start is an n-bit counter and End an n-bit register, for which 
n depends on the size of the RAM set. If two RAM sets have different sizes some of the top bits will be invalidated 
accordingly in the control logic when the associated RAM set is selected. 

[0048] Writing a value in the End register sets the RAM Set control in block fill mode for the block loading. Setting 
the Start address after setting the End initiates a block transfer Setting the Start address without previously setting 

30 the end address or writing the same value in start and end simply loads the corresponding entry. 

[0049] In the case of multiple RAM Sets, the start address detemnines in whk^h RAM set the block load is directed. 
The selection of the RAM set is done by comparing the top part of the start address with the contents of the RAM set 
base address and loading the bottom part in the counter (CNT). If the start address is not included inside any of the 
RAM set, the instruction behaves like a prefetch block or respectively as a prefetch-line on the cache. Depending on 

3s the End and Start values, the block size can vary from one line to n lines. 

[0050] Figure 8 is an illustration of interrupting a block load of the RAM set according to Figure 7 in order to load a 
single line within the block. To reduce system latency, the megacell processors, refen^ed to generically as CPU, can 
still access both cache and RAMset when block loading is in progress; therefore, the following can happen: 

40 (1 ) The CPU accesses a line already loaded. The CPU is sewed immediately or after one cycle stall (conflfct with 

a line load). 

(2) The CPU accesses a line not yet loaded (hit-miss). The CPU is served after the completion of the on-going 
line load. 

45 [0051] Each line load is done in two indivisible steps, first the entry presence is checked (valid bit set), then only if 
the line is not already present in the cache or in the RAMSet, it is loaded. 

[0052] Before initiating a block load by programming new values in End and Start, the status must be checked to 
see that no previous block load is on-going. There is no automatic hardware CPU stall on this case and doing so would 
cause the on-going block load to stop. This could result in an unexpected long latency in a real-time applbations for 
50 accesses into the RAMset in which the block load was interrupted in this manner. 

Cache features 

[0053] The MEGACELL unified cache memory supports write back, write through with/without write-allocate on a 
55 page basis. These controls are part of the MMU attributes. Hit under miss is supported to reduce conflk:ts between 
requesters and consequent latency. Concurrent accesses on RAMsets and cache are supported. 
[0054] On a cache miss, the segment corresponding to the miss is fetched from extemal memory first. For example, 
if the miss occurs in the second segment, the second segment is fetched; then, the third segment and finally the fourth 
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segment is loaded into MEGACELL cache automatically, referred to as automatic hardware prefetch. The first segment 
is not loaded into the cache. This sequence of toads can be inten^upted on a segment boundary by a miss caused by 
a request having higher priority. The intenupted load is not resumed, as the remaining segments will be loaded rf 
required later in response to a new miss. 
5 [0055] Each segment has a valid bit and a dirty bit, on a write back when a line is replaced only the segments with 
modified data are written back. 

[0056] RAMSet configured as a RAM offers fast memory scratchpad feature. 

[0057] In this embodiment, RAMsets do not have Task__ID and R-ID fields and shared bit maricer associated with 
each line. Operations on taskJD, R-ID, data marked as shared are limited to the cache. However, other embodiments 
10 may harmonize the RAMset and cache. 

[0058] MultiK^ycte operations on the MEGACELL outer cache are non-blocking. A Multicycle cache operation is 
launched and a status bit indicates its completion . As operations can be initiated by several requesters, these operations 
can not be blocking due to real time constraints. If one processor initiates a clean_alLtaskJD or a block operation, 
other request can interieave 

15 [0059] The hit \og\c of the outer cache only uses the address field. Task-Id and R-ld are used in task operation only. 
[0060] A random cache replacement strategy has been chosen for the replacement algorithm of the 4-way set as- 
sociative caches. In this embodiment, the caches do not support cache entry locking except through the RAMSet. 
[0061] Table 2 includes a listing of the various cache and RAM control operations that can be invoked by the proc^ 
essors in the megacell. In this embodiment, all operations on an entry operates on segments, in fac^, there are four 

20 segment per entry. 



Table 2 - 





Cache and RAM control operations (C: operation on the cache, RS: operation on RAMset, R: operation on RAM) 


25 


Function 




Software view (memory mapped/ co-proc) 


30 


Flush_entry (address) 


C/RS 


Flush the entry "i, whose address matches the provided 
address or a Range of addresses, if End has been set 
previously. Flush- range instruction is made of two 
consecutive instructions Set_End__addr(address) + 
Flush_entry (address). 




Flush_alLentry_of_ task_ID(task_ID) 


C 


Flush all entries matching to the current taskID in the 
cache but not in the RAMSet 


35 


Flush_alLentry_of_ RJD{task_ID) 


c 


Flush all entries matching to the cun^ent RJD in the 
cache but not in the RAMSet 




Flush_all 


c 


Flush all entries in the cache but not in I^MSet 




Flush_all_shared 


c 


Flush all entries marked as shared 


40 


Flush_alLtask_ID_ shared(task_ID) 


c 


Flush alt entries matching the current taskID and marked 
as shared 




Flush_alLtask_ID_not_shared (task_ID) 


c 


Flush all entries matching the current taskID and marked 
as not shared 


45 


Clean_entry (address) 


C/RS 


Clean the entry"', whose address matches the provided 
address or a Range of address if End has been set 
previously. Clean-range instruction is made of two 
consecutive instructions Set_End_addr(address) + 
Clean_entry (address). 


50 


Clean_alLentry_of_ tasklD(taskJD) 


c 


Clean all entries matching to the current taskID in the 
cache but not in the RAMSet 




Clean_alLentry_Of_ RJD{taskJD) 


c 


Clean all entries matching to the current RJD in the 
cache but not in the RAMSet 


55 


Clean_all 


c 


Clean all entries in the cache but not in 
RAMSet 




Clean_all_shared 


c 


Clean entries marked as shared 
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Table 2 - (continued) 



Cache and RAM control operations (C: operation on the cache, RS: operation on RAMset, R: operation on RAU) 


Function 




Software view (nnemory mapped/ co-proc) 


Flush_all_task_ID_ shared(task_ID) 


C 


Flush all entries matching the current taskID and marked 
as shared 


Clean_alLtasklD_not_shared(Task_ID) 


C 


Clean all entries matching the cun^enttasklD and mariced 
as not shared 


Clean&Flush_single_ entry(address) 


C/RS 


Clean and flush the entry^, whose address matches the 
provided address or a Range of address if End has been 
set previously, oiean-range insirucuon is maoe ot iwo 
consecutive instructions Set_End_addr(address) + 
Clean_entry (address). 


Clean&flush_all_ entry_of_tasklD (TaskJD) 


C 


Clean and flush all entries matching to the cun^ent taskID 
in the cache but not in the RAMSet 


Clean&flush_all_entry_of_R_ID (Task_ID) 


C 


Clean and flush all entries matching to the current R_ID 
in the cache but not in the RAMSet 


Clean&flush_all 


C 


Clean and flush all entries in the cache but not in RAMSet 


Clean&f lush_all_ shared 


C 


Clean and flush entries mariced as shared 


Clean&flush_all_ tasklD_shared {task_ID) 


C 


Clean and flush all entries matching the cunrent taskID 
and marked as shared 


Clean&flush_all_ tasklD_not_shared (task_ID) 


C 


Clean and flush all entries matching the cunrent taskID 
and marked as not shared 


Set_RAM_Set_Base_ addr(RAMSetlD) 


RS/R 


Set new RAM set b£ise address, set VG and clear all VI 
and set End to last RAM set address by default preparing 
the full RAM set loading. In that case no need to write 
the END address before writing the 
start address to load the RAM set 


Set_End_Addr (address) 


C/RS 


Set end address of the next block load and set the RAM 
set controller in block fill mode. 


Set_start_addr (address) 


C/RS 


Set start address of a block and initiates the loading of 
this block 


Flush_RAMset (RAMset_ID) 


RS/R 


Clear VG and all VI of the selected RAM Set 



[0062] Each RAMset has a RAMset base address register, RAMset-base[n]. RAMset base registers are coupled 

with a logical comparison on the address for each requester. 

[0063] A control register is provided for controlling access, RAMset-ctri[n]. 

Bit(0]: 0 MRU master. Only the MRU can write to this register 1 DSR master. Onty the DSR can write to this register 
Bit[1]: 0/1 RAMset work as a cache or as a RAM 

RAMSet master bit: each RAMset can be controlled by one or the other processor, write access to the register base 
[0064] A status register provides cache infomnation, including number of RAM set, sizes. Cache number of way, and 
line size. 

Endianness 

[0065] A system with the MEGACELL will sometimes be deployed in situations that involves mixed endianness. 
Some processors will be bi-endian with a specific endianness selected at reset or on a memory region basis. The 
"endianness" of a processor is a property that describes the orientation of external data when it an-ives at the processor's 
extemal data bus. A processor is little (respectively, big) endian if data objects with ascending addresses wilt appear 
at more (respectively, less) significant places on the data bus. 

[0066] The endianness behavior of the Megacell is defined assuming that the addressable unit of memory is an B-bit 
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byte, the width when referencing a processor's external memory interface is 32 bits, and any shifting required to access 
objects smaller than 32 bits occurs inside the processor, i.e., no shifting is required between the extemal memory 
interface and the memory. 

[0067] The fundamental requirement is that extemal memory be connected to the processor memory interface in 
5 such a manner that accesses to 32-bit (aligned) objects yield the same results in both big and little endian modes of 
operation, whether within different tasks on a single processor or within different processors. As an example, suppose 
that the 32-bit value OxDDCCBBAA is stored In the 32-blt memory cell containing address @1000. Endian invariance 
means that the data lines from the memory must be connected to data portion of the processor's memory interface in 
such a manner that OxDD is wired to the most significant byte of the data bus and OxAA is wired to the least significant 
10 byte; this connection does not depend on the endianness of the processor. 

[0068] Endian invariance does not extend to objects smaller than 32 bits. If the processor writes the 8-bit value OxEE 
to a location with byte address 1 , then the byte overwritten in memory will be the one containing OxBB if the processor 
mode Is little endian and OxCC If it is big endian. Similarly, writing the 16-blt value OxFFEE to location 2 will ovenvrite 
OxDDCC if the processor mode is little endian and OxBBAA if it is big endian. In other words, data objects, smaller than 
IS the size of the data portion of the extemal memory interface, require positioning on the data bus that is offset from the 
most significant end of the bus if the mode is big endian and from the least significant end if the mode is little endian. 
These offsets are implemented in MEGACELL on a region basis (page) by conditionally complementing byte enables 
based on the endianness mode included in the MMU page entry. 

[0069] An access permission fault is generated when the MMU page Endianism does not fit with the corresponding 
20 device Endianism. 

Host interface 

[0070] The host interface allows access to MEGACELL internal and extemal memories. These accesses are pro- 
25 tected by a jiTLB controlled by one of the MEGACELL processor. When an address is missing from the p.TLB, it 
searches the shared TLB. If a miss occurs in both, an intermpt is retumed to the processor in charge of the host Two 
registers are associated with the host interface to determine the resource identifier ( R-l d) and the task identifier allocated 
to the host. There is also a control register in the megacell that includes a bit to select whbh CPU is in charge of 
controlling the host. 

30 

Detailed aspects 

[0071] Various aspects of the digital system of Figure 1 will now be described in moro detail. 
[0072] Figure 9 is an illustration of a level 2 cache in the megacell of Figure 1 that has an intermptlble prefetch 
35 system, and thereby provides miss under miss support. As described above, the level 2 cache architecture is embodied 
with 4-way associativity, four segments per entry and four valid and dirty bits. When the level 2-Gache misses, the 
penalty to access to data within the level 3 SDRAM is high. The system supports miss under miss to let another miss 
intenupt the segment prefetch. For example, when a processor PI access to its Level 1 cache misses and the Level 

2 cache misses, the Level 2-cache controller transfer one or several segments of 16 Bytes from the SDRAM to the 
^ cache line. The controller generates the address header and one or several segments of 1 6 Bytes can be transferred 

for the same request. If, for example, segment 2 misses then the controller fetches segment 2 and prefetches segments 

3 and 4. During the miss time other request that hit the level 2 cache can be served. Subsequently, if a processor P2 
misses the level 2 cache, the ongoing prefetch sequence for processor PI is stopped and the P2 miss is served. Thus, 
an interruptible SDRAM to L2-cache prefetch system with miss under miss support is provided. 

^ [0073] Figure 1 0 illustrates concurrent access of the level 2 cache and level 2 RAM of the megacell of Figure 1 . The 
shared L2-SmartCache's RAM and Cache sets can be accessed concurrently. When different processors request an 
access to memory space stored In different memory blocks the system servbes accesses in parallel. Parallel decode 
is done on RAMSets to reorder accesses not located in RAMsets on the 4-way set associative part of the Smartcache. 
If a concurrent access is not possible because the two memory spaces corresponding to the requests is in the same 

so memory block, then the request are served sequentially. 

[0074] Figure 11 A illustrates a request queue for the level 2 memory system of Figure 10. The system contains a 
request queue that stores the waiting access request from different processors. The system evaluates any possible 
concurrency within the waiting list and serves accesses in parallel when different processors access different memory 
blocks. Hit logic for L2-SmartCache supports concurrent access request. 

S5 [0075] Figure 1 1 B is a more detailed block diagram of the level 2 memory system of Figure 1 0, illustrating the request 
queue circuitry. 

[0076] Figure 12 illustrates tag drcuitry with task ID and resource ID fields in the level 2 cache of the megacell of 
Figure 2. As discussed eariier, the shared multiprocessor L2_cache architecture has a taskJD field and Resource.lD 
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field to identify the device using the con^esponding resource and task. Adding task-ID and R-ID to shared levet-2 cache 
identifies all entries belonging to a task or/and to resource. This provides improved system safety. 
[0077] In a dynamic system environment and at fortiori in a multi-processor system with shared memory cache, it 
becomes mandatory due to the cache size to have selective control over cache to improve perfonnance and reduce 
5 power consumption. Task-ID and resource-ID have been added to the TAG an-ay as a qualifier field for cache operation 
such as flush (invalidate), clean or even lock/unlock. All entries of the shared system cache belonging to a task or 
respectively to one of the system resource (CPU, coprocessors,.,.) can be identified within a single cache command, 
as illustrated in Table 2. 

[0078] On detection of the command "flush_alLentry_ relatedJo.taskJD" issued by the MPU, a hardware counter 
10 is incremented to search all the L2_cache and the command flushes all entries belonging to the given task identifier 
(task-ID) or/and to the given resource identifier (R-ID). At each iteration of the hardware loop, the task-ID, or/and 
respectively the R-ID, field is compared with the task-ID, or/and respectively the R-ID, provided through the command. 
In case of match, the entry is flushed out. Similarly, the system supports clean and dean&flush operations based on 
task-ID field and R-tD field. This fast hardware looping mechanism is also applied to a one-bit field called "shared". 
15 Similarly, all entry mariced as "shared" can be cleaned or flushed out through a single command. The master CPU, or 
any CPUs in the system within their R-ID limits, can initiate these commands. Ordinary accesses, resulting from a 
LI -miss, will stall these commands, which are then automatk:ally resumed after the ordinary access is satisfied. In 
another embodiment, this feature could also be applied to an L1 cache; however the benefit would probably be lower 
due to smaller L1 cache size. 

^ [0079] In this embodiment, a hardware loop controlled by a single command is provided by a state machine under 
control of the MPU to clean or flush all entries of a given task. In another embodiment, a similar state machine can be 
under control of the DSP or a an extemal host processor. Alternatively, control circuitry can be provided to perfomi the 
clean or flush operation in a simultaneous manner for all entries, rather than operating in a looping manner. 
[0080] Figure 1 3 is a block diagram illustrating monitoring circuitry within the megacell of Figure 2 to manage cleaning 

25 and flushing based on an average miss rate measure. For large caches the penalty to dean or flush is high even if 
only the entries corresponding to a task or resource are considered. If the cache is not flushed at some point of time, 
the miss rate may increase. Therefore, in the current embodiment, the OS periodically monitors the miss rate counter 
(Miss_CNT) and uses the miss rate to decide when to flush the entries corresponding to a task or resource recently 
deleted. 

30 [0081] Figure 14 is a block diagram illustrating a priority register in each processor of the megacell of Figure 2 for 
task based priority arbitration. A priority register is associated with the task_ID register in the processor. The priority 
field is exported to the traffic control logic. In a first embodiment with a simple solution with 1 bits only, one is set by 
the hardware when an interrupt (or an exception) occurs. The OS controls both bits but an application only controls 
the second bit. 

35 [0082] in an altemative embodiment, 1+n bits are provided for the OS priority field (in general n=8-bits). In either 
embodiment, the 2 bits or n+1 bits are used to control the priority of accesses to the Megacell shared resources. 
[0083] Two fields are used to detemrrine the access priority to the shared resources. One field comes from the proc- 
essor and carry the priority associated to the cun-ent task and the second field comes from the MMU TLB that contains 
the priority of a given MMU page. The highest value is used for priority ari^itration. 

40 [0084] Figure 1 5 is a block diagram of the level 1 caches in the megacell of Figure 2 illustrating control circuitry for 
intenuptible block prefetch and clean functions. As discussed eariier, the RAMSet of the SmartCache can be managed 
in chunks of contiguous memory. Standard cache operations such as miss resulting from a CPU read access on the 
RAMSet or clean entry are respectively changed into block prefetch operation or block cleaning operation if the end 
of block register has been previously programmed. These block operations can also result from the programming of 

45 two registers end-of-block and start-of-block. These prefetch/clean operations are interruptible on the completion of a 
line to guarantee maximum latency for real-time systems. The block prefetch operation can re-use the existing hardware 
used for full cleaning of the cache or have a different counter During the block operation the CPU can be in wait and 
its activity is resumed on reception of an interruption which stops the current block operation or the CPU can be con- 
cunrently running with a single cycle stall during line transfer in the write/read buffer. 

50 [0085] Thus, the present embodiment provides an interruptible prefetch block on RAM Set using cu n-ent cache mech- 
anism: miss on load and clean entry after programming the end-of-block register, the CPU being in wait during block 
operation. 

[0086] The present embodiment provides the ability to prefetch block on RAMSet using the cache mechanism: miss 
on load and clean entry after programming the end-of-block registerwithconcun-ent CPU cache and/or RAMSet access. 
55 [0087] The present embodiment provides perfomDS both of the above using an end-of block register and a start-of 
block register to initiate block operation (Initial value of the block counter). 

[0088] The present embodiment also extends the Interruptible Prefetch/save block scheme to the cache with no 
boundary limit between cache and RAMSet. TTiis is the same as cache operation based on operation on range of 
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address. 

[0089] Figure 1 6 is a blocic diagram of the cache of Figure 1 5 illustrating a source/destination register for DMA op- 
eration. The RAMSet of the SmartCache is configured as a local memory with DMA support provided by the cache 
hardware logic or a separate DMA logic. The SmartCache commands are Indifferently used in both modes. A register 

5 is added to provide the destination/source address that enables re-allocation of data/instruction during transfer fronn/ 
to external memory. The existing valid bits of the RAMSet are used to monitor the DMA progress, which allows the 
CPU to have access to the RAMSet concurrently with the DMA operation, Including within the range of address of the 
DMA. Thus, identical control for local memory working as a cache (RAMSet) or as a local memory with DMA is provided. 
[0090] Figure 17 illustrates operation of the cache of Figure 16 using only a global valid bit for DMA completion 

10 status. The RAMSet of the SmartCache is configured as a local memory with DMA support provided by the cache 
hardware logic or a separate DMA logic. The SmartCache commands are indifferently used In both modes. A register 
is added to provide the destination/source address that enables re-allocation of the data/instmc^ion during transfer 
f ronn/lo external memory. The valid bits of the RAMset are not used but the DMA progress is simply monitored with the 
status bit to indicate its completion. Thus, concun^ent access on cache or on both cache and RAMSet are provided, 

IS except in the DMA range during DMA on RAMSet. 

[0091] Figure 18 illustrates operation of the cache of Figure 15 in which a block of lines is cleaned or flushed. Pro- 
gramming register "end of block" changes a cache operation such as clean or flush active on a single entry to an 
operation on a block of lines located between this entry and the entry pointed by the "end of block" register. The function 
can also be implemented using an "end-of block" register and start-of block (initial value of the block counter). 

20 [0092] Figure 1 9 illustrates register and arbitration circuitry in the cache of Figure 15 to support local memory with 
DMA operation simultaneously with RAM set operation in the same RAM set. This includes register and arbitration 
logic to support simultaneously local memory with DMA and RAMSet behavior on the same RAMSet. One part of the 
RAMSet behaves as a RAMSet, the other part behaves as a local memory with DMA. There Is one additional base 
address register (base-DMA) to indicate the beginning of the section of the RAMSet behaving as a local memory with 

25 DMA. As this Is a working area, only one register is needed to split the RAMSet in two parts. 

[0093] Figure 20 illustrates use of a local valid bit to support concurrent DMA and CPU access in the cache of Figure 
1 5. The local memory is segmented In line with individual valid bits enabling a CPU to access any line outside or inside 
the DMA range concurrently while the DMA transfer is on going. If a CPU is accessing a line ahead of the DMA, the 
DMA Is momentary stalled to load the line accessed by the CPU and the DMA is then resumed. Prior to loading a line, 

30 the DMA engine checks the valid bit to avoid overwriting a valid line, which would have been loaded ahead of the DMA 
execution in response to a CPU access. 

[0094] Figure 21 illustrates operation of the TLB of Figure 3 for selective flushing of an entry for a given task or 
resource. A "resource ID" or/and a task-ID field stored as independent fields in the TLB TAG array is used to selectively 
flush all entries of a given task or a give resource (requester). Thus, a single operation is provided to flush all entries 
35 of a given task located in a TLB. In this embodiment, the TLB cache is made of several levels of TLB, and all levels 
are flushed simultaneously. 

[0095] The TLB structure includes a field identifying the processing resource or memory accesses requestor (RJd). 
This "resource ID" field is part of the TLB TAG array, to enable requestor-selective operations (such as flushes). This 
does, for instance, permrt flushing all entries related to a processor that will be shutdown for energy savings. 

40 [0096] Figure 22 illustrates control circuitry for adaptive replacement of TLB entries in the TLB of Figure 3. In this 
multi-processor system with system shared TLB, the need has arisen to control the TLB on a task basis. The function 
"Lock/unlock all entries of a given task" is provided by the comparison of the task-id field in the TLB. If this field matches 
the supplied task-id, the associated Lock bit of the matching entry is cleared. In a TLB implemented with a content 
addressable memory (CAM), all entries with the same task-ID are unlocked In one cycle; in a TLB implemented with 

45 a RAM, the function is done through a hardware loop. 

[0097] In order to support such a function in the most optimized way, an adaptive replacement algorithm taking into 
account locked entries and empty entries Is provided. When the TLB is full, random replacement based on a simple 
counter (Vtetim CNT) is used to select the victim entry. On a miss, the lock bit of the victim entry is checked; if it is 
locked, the victim counter is incremented further in background of the table walk until a non-locked entry is found. 

50 When the TLB is not full, the vtetim counter is incremented further until an empty entry is found. After a flush entry, the 
victim "counter" is updated with the location value of the flush entry and stasys unchanged until a new line is loaded 
in order to avoid unnecessary searching. A second implementation provides the capability to do the search instanta- 
neously by providing in an external logic the lock and valid bit. 

[0098] Thus, Lock/unlock operation on the TLB based on task-ID is provided. A Random Replacement algorithm for 
55 TLB is changed into cyclic on empty entry detection and lock victim entry detection. 

[0099] Still refening to Figure 22, the TLB TAG Includes a one-bit-field (S/P) Indicating if the con^esponding address 
or page is shared or private. All entries marked as shared can be flushed in one cycle globally or within a task. 
[0100] Figure 23 is a block diagram of control circuitry in the megacell of Figure 2 for dynamic control of power 
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management systems using task attributes. A dynamic system power/energy management scheme based on hardware 
control via run-time and task attributes register is provided. On a given processor, whenever a context switch occurs, 
the Operating System loads a cun-ent task ID register (Current Task ID), task priority and attributes. The attribute 
register can contains control bit for each major block of the CPU subsystem or the overall system. The supply voltage 
5 of each block can be defined according to the current task requirement. Some attributes can be also fixed at run-time. 
One or multiple power control registers can be loaded with power attributes by a processor each time this task is 
scheduled on this processor (task attributes), or each time a new scenario is built for the processor or the whole system 
(run-time attributes). 

[0101] Figure 24 illustrates dynamic hardware configuration of the megacell of Figure 2 using task attributes. A dy- 

10 namb way to reconfigure a hardware logk; module for a given task, according to its resources requirements is provided 
in this embodiment. A configuration word(s) are written into a register (attrib), a memory or a programmable control 
structure (FPLA) by a processor each time its operating system switches to a new task. This permits reuse of complex 
hardware \og\c for multiple functions, but also dynamteally optimizes perfomnance and energy consumption of this logk; 
for a broader application range. 

15 [0102] Figure 25 illustrates task based event profiling to perfomn task scheduling for control of power dissipation 
within the system of Figure 1 . A way to measure system energy consumed by a given task is provided. This measure 
is performed through a set of HW counters (event counters) triggered by a task ID. Each counter records activity 
associated with a specific region of the megacell that can be correlated with power consumption, such as signal tran- 
sitions on a bus, for example. In order to profile a given task, the counters are enabled only when the given task is 

20 active, as indicated by the task ID register. 

[0103] Figure 26 illustrates operation of the level 2 TLB of Figure 2 while being shared between different operating 
systems. In a unified memory management context, a TLB is shared between different OS environments to accelerate 
address translation of different processors accessing a common memory environment, and insure protection of this 
environment. The term "processor*, in this case, designates any access requestor to the memory environment It can 

25 be a microprocessor unit, a DSP, an applcation specific processor (ASP), coprocessor or DMA controller. This permits 
efficient use of common hardware table entries in a buffer between different processors that will typically have heter- 
ogeneous needs in tonus of frequently accessed address pages. 

[01 04] This shared TLB concept is extended in a multi-level TLB hierarchy system of the present embodiment, where 
each processor has its own micro-TLB whose page faults are directed to the shared TLB. 

30 [0105] Figure 27 illustrates operation of the level 1 TLB of Figure 2 while being shared between different memory 
access requestors operating under a common operating system. A TLB is shared between different processing re- 
sources or memory access requestors (processor, coprocessor, DMA) in a common OS environment to accelerate 
address translation when accessing a common memory environment, and insure protection of this environment. 
[01 06] This pennits efficiently use of common hardware between different hardware resources that will typk:ally have 

^ heterogeneous needs in temns of frequently accessed address pages. 

[0107] This shared TLB concept is also be used In the multi-level TLB hierarchy system of this embodiment, where 
each processor has its own micro-TLB, whose page faults are directed to the shared TLB. 

Digital System Embodiment 

40 

[0108] Figure 28A illustrates an exemplary implementation of an example of such an integrated circuit in a mobile 
telecommunk:ations device, such as a mobile telephone with integrated keyt>oard 12 and display 14. As shown in 
Figure 28, the digital system 1 0 with a megacell according to Figure 2 is connected to the keyboard 12, where appro- 
priate via a keyboard adapter (not shown), to the display 14, where appropriate via a display adapter (not shown) and 

^ to radio frequency (RF) circuitry 1 6. The RF circuitry 1 6 Is connected to an aerial 1 8. 

[0109] Figure 28B is a block diagram representation of the telecommunications device of Figure 28A. Specif k^ally. 
Figure 28B illustrates the construction of a wireless communications system, namely a digital cellular telephone handset 
226. It is contemplated, of course, that many other types of communications systenns and computer systems may also 
benefit from the present invention, particularty those relying on battery power. Examples of such other computer sys- 

50 tems include personal digital assistants (PDAS), portable computers, and the like. As power dissipation is also of 
concern in desktop and line-powered computer systems and micro-controller applications, partknjiariy from a reliability 
standpoint, it is also contemplated that the present invention may also provide benefits to such line-powered systems. 
[0110] Handset 226 includes microphone M for receiving audio input, and speaker S for outputting audible output, 
in the conventional manner. Microphone M and speakers are connected to audio interface 228 which, in this example, 

55 converts received signals into digital form and vice versa. In this example, audio input received at microphone M is 
processed by filter 230 and analog-to-digital converter (ADC) 232. On the output side, digital signals are processed 
by digital-to-analog converter (DAC) 234 and filter 236, with the results applied to amplifier 238 for output at speaker S. 
[0111] The output of ADC 232 and the input of DAC 234 in audio interface 228 are in communication with digital 



13 



EP1 182 559 A1 



interface 240. Digital interface 240 is connected to nnicro-controlter 242 and to digital signal processor (DSP) 190. 
Micro-controller 242 and DSP 190 are innplennented in a megacell such as illustrated in Figure 2 and includes the 
various aspects disclosed herein. 

[0112] Micro-controller 242 controls the general operation of handset 226 in response to input/output devices 244, 

5 examples of which include a keypad or keyboard, a user display, and add-on cards such as a SIM card. Micro-controller 
242 also manages other functions such as connection, radio resources, power source monitoring, and the like. In this 
regard, circuitry used in general operation of handset 226, such as voltage regulators, power sources, operational 
amplifiers, clock and timing circuitry, switches and the like are not illustrated in Figure 28B for clarity; it is contemplated 
that those of ordinary skill in the art will readily understand the architecture of handset 226 from this description. 

10 [01 13] In handset 226, DSP 1 90 is connected on one side to interface 240 for communication of signals to and from 
audio Interface 228 (and thus microphone M and speaker S), and on another side to radio frequency (RF) circuitry 
246, which transmits and receives radio signals via antenna A. Conventional signal processing perfomned by DSP 190 
may include speech coding and decoding, en^or connection, channel coding and decoding, equalization, demodulation, 
encryption, vobe dialing, echo cancellation, and other similar functions to be performed by handset 190. 

IS [0114J RF circuitry 246 bidirectionally communicates signals between antenna A and DSP 1 90. For transmission, 
RF circuitry 246 includes codec 248 that codes the digital signals into the appropriate fomri for application to modulator 
250. Modulator 250, in combination with synthesizer circuitry (not shown), generates modulated signals corresponding 
to the coded digital audio signals; driver 252 amplifies the modulated signals and transmits the same via antenna A. 
Receipt of signals from antenna A is effected by receiver 254, which applies the received signals to codec 248 for 

20 decoding into digital form, application to DSP 1 90, and eventual communk^ation, via audio interface 228, to speaker S. 
[0115] Fabrication of the digital systems disclosed herein involves multiple steps of implanting various amounts of 
impurities into a sem Conductor substrate and diffusing the impurities to selected depths within the substrate to form 
transistor devces. Masks are fomned to control the placement of the impurities. Multiple layers of conductive material 
and insulative material are deposited and etched to interconnect the various devices. These steps are performed in a 

25 clean room environment. 

[01 16] A significant portion of the cost of producing the data processing device involves testing. While in wafer fomn, 
individual devk^s are biased to an operational state and probe tested for basic operational functionality. The wafer is 
then separated into individual dice whbh may be sold as bare die or packaged. After packaging, finished parts are 
biased into an operational state and tested for operational functionality. 

30 [01 17] The digital systems disclosed herein contain hardware extensions for advanced debugging features. These 
assist in the development of an application system. Since these capabilities are part of the megacell itself, they are 
available utilizing only a JTAG interface with extended operating mode extensions. They provide simple, inexpensive, 
and speed independent access to the core for sophistk^ted debugging and economical system development, without 
requiring the costly cabling and access to processor pins required by traditional emulator systems or intruding on 

35 system resources. 

[0118] As used herein, the terms "applied," "connected," and "connection" mean electrically connected, including 
where additional elements may be in the electrical connection path. "Associated" means a controlling relationship, 
such as a memory resource that is controlled by an associated port. The temns assert, assertion, de-assert, de-asser- 
tion, negate and negation are used to avoid confusion when dealing with a mixture of active high and active low signals. 

^ Assert and assertion are used to indicate that a signal is rendered active, or logk^ally true. De-assert, de-assertion, 
negate, and negation are used to indk^te that a signal is rendered inactive, or logk^ally false. 
[0119] While the invention has been described with reference to illustrative embodiments, this description is not 
intended to be construed in a limiting sense. Various other embodiments of the invention will be apparent to persons 
skilled in the art upon reference to this description. 

^ [01 20] It is therefore contemplated that the appended daims will cover any such modifications of the embodiments 
as fall within the true scope and spirit of the invention. 



Claims 

50 

1 . A digital system comprising: 
a plurality of processors; 

a plurality of private level 1 caches, each associated with a respective one of the plurality of processors; 
55 a shared level 2 cache having a plurality of segments per entry connected to transfer a data segment to each 

private level 1 cache; 

a level 3 physcal memory connected to provide a plurality of data segments to the shared level 2 cache, 
wherein the shared level 2 cache is operable to request transfer of a first plurality of segments in response to 



14 



EP 1 182 559 A1 



a first miss in a first private level 1 cache; and 

wherein the shared level 2 cache is operable to stop transfening the first plurality of segments and to start 
transferring a second plurality of segments In response to a second miss in a second private level 1 cache. 

5 2. The digital system of Claim 1 . wherein the shared level 2 cache comprises a plurality of tag entries, wherein each 
tag entry has a resource ID field. 

3. The digital system according to Claim 2, wherein each tag entry has a task ID field. 

10 4. The digital system according to any of Claims 2-3, further comprising a shared translation lookaside buffer (TLB), 
wherein the TLB has a plurality of page entries, and wherein each page entry has a resource ID field. 

5. The digital system according to any of Claims wherein each page entry has a task ID field. 

15 6. The digital system according to any of Claims 1 -5, wherein the shared level 2 cache comprises a portion that is 
configurable as a RAMset, wherein the RAMset Is operable to load a block of segments in an intenxiptible manner. 

7. The digital system according to Claim 6, wherein the shared level 2 cache comprises control circuitry that can be 
configured to operate in DMA mode. 



20 
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55 



8. The digital system according to any of Claims 4-7, wherein each page entry of the shared TLB has a endlaness field. 



9. The digital system according to any of Claims 1 -8 further comprising configuration circuitry associated with at least 
a first oneof the plurality of processors, wherein the configuration circuitry is responsive to a task ID value to select 

25 an operating parameter for the first processor. 

10. The digital system according to any of Claims 1-9 being a cellular telephone, further comprising: 

an integrated keyboard connected to the CPU via a keyboard adapter; 
30 a display, connected to the CPU via a display adapter; 

radio frequency (RF) circuitry connected to the CPU; and 
an aerial connected to the RF circuitry. 
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